Goto

Collaborating Authors

 video action recognition







3776558654d8db1bfcb9ebde0e01184e-Supplemental-Conference.pdf

Neural Information Processing Systems

Wethus add more parameters in the head network and see ifthis could close the gap. As UPerNet has anFPN-likehead network, we 1 add parameters by replacing FPN with BiFPN. Fromthisfigure,wecan observethat the features across heads inthe Transformer decoder are almost the same. Semantic Segmentation on ADE20KFor the semantic segmentation task, we adopt widelyused ADE20K [11] as the benchmark. Table 7: Hyperparameters for the frozen setting and full finetuning on Kinetics-400 video action recognition.




CAST: Cross-Attention in Space and Time for Video Action Recognition

Neural Information Processing Systems

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-Kitchens-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics. The code is available at https://github.com/KHU-VLL/CAST.


Alignment-guided Temporal Attention for Video Action Recognition

Neural Information Processing Systems

Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more efficient in computation, the latter often obtains better performance. In this paper, we attribute this to a dilemma between the sufficiency and the efficiency of interactions among various positions in different frames. These interactions affect the extraction of task-relevant information shared among frames. To resolve this issue, we prove that frame-by-frame alignments have the potential to increase the mutual information between frame representations, thereby including more task-relevant information to boost effectiveness. Then we propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames. It can act as a general plug-in for image backbones to conduct the action recognition task without any model-specific design. Extensive experiments on multiple benchmarks demonstrate the superiority and generality of our module.